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Abstract 

We consider the problem of optimizing the sum of a smooth convex function and a 
non-smooth convex function using proximal-gradient methods, where an error is present 
in the calculation of the gradient of the smooth term or in the proximity operator with 
respect to the non-smooth term. We show that both the basic proximal-gradient method 
and the accelerated proximal-gradient method achieve the same convergence rate as in 
the error- free case, provided that the errors decrease at appropriate rates. Using these 
rates, we perform as well as or better than a carefully chosen fixed error level on a set 
of structured sparsity problems. 

1 Introduction 

In recent years the importance of taking advantage of the structure of convex optimization 
problems has become a topic of intense research in the machine learning community. This 
is particularly true of techniques for non-smooth optimization, where taking advantage of 
the structure of non-smooth terms seems to be crucial to obtaining good performance. 
Proximal-gradient methods and accelerated proximal-gradient methods [U [2] are among 
the most important methods for taking advantage of the structure of many of the non- 
smooth optimization problems that arise in practice. In particular, these methods address 
composite optimization problems of the form 

minimize f(x):=g(x) + h(x), (1) 

where g and h are convex functions but only g is smooth. One of the most well-studied 
instances of this type of problem is ^-regularized least squares [31 2], 

minimize - \\Ax — b\\ 2 + A||x||i, 
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where we use || • || to denote the standard ^-norm. 

Proximal-gradient methods are an appealing approach for solving these types of non-smooth 
optimization problems because of their fast theoretical convergence rates and strong prac- 
tical performance. While classical subgradient methods only achieve an error level on the 
objective function of 0{\/\fk) after k iterations, proximal-gradient methods have an error 
of 0(1/ k) while accelerated proximal-gradient methods futher reduce this to 0(l/k 2 ) [JJ[2]. 
That is, accelerated proximal-gradient methods for non-smooth convex optimization achieve 
the same optimal convergence rate that accelerated gradient methods achieve for smooth 
optimization. 

Each iteration of a proximal-gradient method requires the calculation of the proximity 
operator, 

prox L (y) = arg min — \\x — y\\ 2 + h(x), (2) 

where L is the Lipschitz constant of the gradient of g. We can efficiently compute an 
analytic solution to this problem for several notable choices of h, including the case of 
^i-regularization and disjoint group ^i-regularization [SJ [6]. However, in many scenarios 
the proximity operator may not have an analytic solution, or it may be very expensive to 
compute this solution exactly. This includes important problems such as total-variation 
regularization and its generalizations like the graph-guided fused-LASSO [7J [8], nuclear- 
norm regularization and other regularizers on the singular values of matrices p2 [10] , and 
different formulations of overlapping group l\ -regularization with general groups |11| 112]. 
Despite the difficulty in computing the exact proximity operator for these regularizers, 
efficient methods have been developed to compute approximate proximity operators in all 
of these cases; accelerated projected gradient and Newton-like methods that work with 
a smooth dual problem have been used to compute approximate proximity operators in 
the context of total- variation regularization [13] , Krylov subspace methods and low-rank 
representations have been used to compute approximate proximity operators in the context 
of nuclear- norm regularization [3 [10], and variants of Dykstra's algorithm (and related dual 
methods) have been used to compute approximate proximity operators in the context of 
overlapping group £i-regularization [12j HU [15] . 

It is known that proximal-gradient methods that use an approximate proximity operator 
converge under only weak assumptions |16| ITT] ; we briefly review this and other related 
work in the next section. However, despite the many recent works showing impressive em- 
pirical performance of (accelerated) proximal-gradient methods that use an approximate 
proximity operator O [TUl ESI HH [15] , up until recently there was no theoretical analysis 
on how the error in the calculation of the proximity operator affects the convergence rate 
of proximal-gradient methods. In this work, we show in several contexts that, provided 
the error in the proximity operator calculation is controlled in an appropriate way, inexact 
proximal-gradient strategies achieve the same convergence rates as the corresponding exact 
methods. In particular, in Section [4] we first consider convex objectives and analyze the in- 
exact proximal-gradient (Proposition [I]) and accelerated proximal-gradient (Proposition [2]) 
methods. We then analyze these two algorithms for strongly convex objectives (Proposi- 
tion and Proposition HI) . Note that, in these analyses, we also consider the possibility 
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that there is an error in the calculation of the gradient of g. We then present an experi- 
mental comparison of various inexact proximal-gradient strategies in the context of solving 
a structured sparsity problem (Section [5]). 

2 Related Work 

The algorithm we shall focus on in this paper is the proximal-gradient method 

x k = prox L [y k _x - (1 / L)(g' (y k _i) + e k )] , (3) 

where e k is the error in the calculation of the gradient and the proximity problem Q is 
solved inexactly so that x k has an error of e k in terms of the proximal objective function ([2]). 
In the basic proximal-gradient method we choose y k = x k , while in the accelerated proximal- 
gradient method we choose 

Vk = x k + f3 k (x k - x k -{), 

where the sequence {/3 k } is chosen to accelerate the convergence rate. 

There is a substantial amount of work on methods that use an exact proximity operator but 
have an error in the gradient calculation, corresponding to the special case where e k = 
but e k is non-zero. For example, when the e k are independent, zero-mean, and finite- 
variance random variables, then proximal-gradient methods achieve the (optimal) error 
level of 0(1/ Vk) [L81Q2]. This is different than the scenario we analyze in this paper since 
we do not assume unbiased nor independent errors but instead consider a sequence of errors 
converging to 0. This leads to faster convergence rates and makes our analysis applicable 
to the case of deterministic and even adversarial errors. 

Several authors have recently analyzed the case of a fixed deterministic error in the gradient, 
and shown that accelerated gradient methods achieve the optimal convergence rate up to 
some accuracy that depends on the fixed error level |2Ul [2"T| [2"2"] , while the earlier work of [23 
analyzes the gradient method in the context of a fixed error level. This contrasts with our 
analysis where by allowing the error to change at every iteration we can achieve convergence 
to the optimal solution. Also, we can tolerate a large error in early iterations when we are 
far from the solution, which may lead to substantial computational gains. Other authors 
have analyzed the convergence rate of the gradient and projected-gradient methods with 
a decreasing sequence of errors \24\ [25] but this analysis does not consider the important 
class of accelerated gradient methods. In contrast, the analysis of [22] allows a decreasing 
sequence of errors (though convergence rates in this context are not explicitly mentioned) 
and considers the accelerated projected-gradient method. However, the authors of this 
work only consider the case of an exact projection step and they assume the availability 
of an oracle that yields global lower and upper bounds on the function. This non-intuitive 
oracle leads to a novel analysis of smoothing methods, but also to slower convergence rates 
than proximal-gradient methods. The analysis of [2T] considers errors in both the gradient 
and projection operators for accelerated projected-gradient methods but requires that the 
domain of the function is compact. None of these works consider proximal-gradient methods. 
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In the context of proximal-point algorithms, there is a substantial literature on using inexact 
proximity operators with a decreasing sequence of errors, dating back to the seminal work of 
Rockafellar |26j . Accelerated proximal-point methods with a decreasing sequence of errors 
have also been examined, beginning with [27]. However, unlike proximal-gradient methods 
where the proximity operator is only computed with respect to the non-smooth function 
h, proximal-point methods require the calculation of the proximity operator with respect 
to the full objective function. In the context of composite optimization problems of the 
form 0, this requires the calculation of the proximity operator with respect to g + h. 
Since it ignores the structure of the problem, this proximity operator may be as difficult to 
compute (even approximately) as the minimizer of the original problem. 

Convergence of inexact proximal-gradient methods can be established with only weak as- 
sumptions on the method used to approximately solve ([2]). For example, we can establish 
that inexact proximal-gradient methods converge under some closedness assumptions on 
the mapping induced by the approximate proximity operator, and the assumption that the 
algorithm used to compute the inexact proximity operator achieves sufficient descent on 
problem ^ compared to the previous iteration x^-i [16] . Convergence of inexact proximal- 
gradient methods can also be established under the assumption that the norms of the errors 
are summable (T7J. However, these prior works did not consider the rate of convergence 
of inexact proximal-gradient methods, nor did they consider accelerated proximal-gradient 
methods. Indeed, the authors of [7J chose to use the non-accelerated variant of the proximal- 
gradient algorithm since even convergence of the accelerated proximal-gradient method had 
not been established under an inexact proximity operator. 

While preparing the final version of this work, |28j independently gave an analysis of the 
accelerated proximal-gradient method with an inexact proximity operator and a decreasing 
sequence of errors (assuming an exact gradient). Further, their analysis leads to a weaker 
dependence on the errors than in our Proposition [2] However, while we only assume that 
the proximal problem can be solved up to a certain accuracy, they make the much stronger 
assumption that the inexact proximity operator yields an £fc-subdifferential of h [281 Defini- 
tion 2.1]. Our analysis can be modified to give an improved dependence on the errors under 
this stronger assumption. In particular, the terms in ^/el disappear from the expressions of 
A/-, Ak and A^. In the case of Propositions [l] and [2J this leads to the optimal convergence 
rate with a slower decay of £j. More details may be found after Lemma [2] in the Ap- 
pendix. More recently, [29] gave an alternative analysis of an accelerated proximal-gradient 
method with an inexact proximity operator and a decreasing sequence of errors (assuming 
an exact gradient), but under a non-intuitive assumption on the relationship between the 
approximate solution of the proximal problem and the £fc-subdifferential of h. 



3 Notation and Assumptions 

In this work, we assume that the smooth function g in ([I]) is convex and differentiable, and 
that its gradient g' is Lipschitz-continuous with constant L, meaning that for all x and y in 
R d we have 

\\g\x) -g'(y)\\ < . 



4 



This is a standard assumption in differentiable optimization, see [30, §2.1.1]. If g is twice- 
differentiable, this corresponds to the assumption that the eigenvalues of its Hessian are 
bounded above by L. In Propositions [3] and [4] only, we will also assume that g is //-strongly 
convex (see [30|, §2.1.3]), meaning that for all x and y in M. d we have 



9{y) > g{v) + (g'(?),y- %) + -\\u - ■<• 
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However, apart from Propositions [3] and [4j we only assume that this holds with /x = 0, 
which is equivalent to convexity of g. 

In contrast to these assumptions on g, we will only assume that h in (fll) is a lower semi- 
continuous proper convex function (see [211 §1-2]), but will not assume that h is differentiable 
or Lipschitz-continuous. This allows h to be any real- valued convex function, but also allows 
for the possibility that h is an extended real-valued convex function. For example, h could 
be the indicator function of a convex set, and in this case the proximity operator becomes 
the projection operator. 

We will use x k to denote the parameter vector at iteration k, and x* to denote a minimizer 
of /. We assume that such an x* exists, but do not assume that it is unique. We use e k to 
denote the error in the calculation of the gradient at iteration k, and we use e k to denote 
the error in the proximal objective function achieved by x k , meaning that 

^\\x k - y\\ 2 + h(x k ) ^ e k + min { - \\x - y\\ 2 + h(x) 1 , (4) 
2 xeR d 2 



where y = y k —i — (l/-^)(g'(y£;-i) + e fe))- Note that the proximal optimization problem ([2]) 
is strongly convex and in practice we are often able to obtain such bounds via a duality gap 
(e.g., see [12] for the case of overlapping group £i-regularization). 



4 Convergence Rates of Inexact Proximal-Gradient Methods 

In this section we present the analysis of the convergence rates of inexact proximal-gradient 
methods as a function of the sequences of solution accuracies to the proximal problems 
and the sequences of magnitudes of the errors in the gradient calculations {||efc[|}. We shall 
use (H) to denote the set of four assumptions which will be made for each proposition: 

• g is convex and has L-Lipschitz-continuous gradient; 

• h is a lower semi-continuous proper convex function; 

• The function / = g + h attains its minimum at a certain x* € M n ; 

• Xfc is an £/%-optimal solution to the proximal problem ([2]) in the sense of Q. 

We first consider the basic proximal- gradient method in the convex case: 
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Proposition 1 (Basic proximal-gradient method - Convexity) Assume (H) and that 
we iterate recursion ^ with y k = x k - Then, for all k 1, we have 



(5) 



with 




The proof may be found in the Appendix. Note that while we have stated the proposition 
in terms of the function value achieved by the average of the iterates, it trivially also holds 
for the iteration that achieves the lowest function value. This result implies that the well- 
known 0(l/k) convergence rate for the gradient method without errors still holds when 
both {||efc||} and {y/£k} are summable. A sufficient condition to achieve this is for ||efc|| and 
Je^ to decrease as 0(l/k 1+s ) for any 5 > 0. Note that a faster convergence of these two 
errors will not improve the convergence rate but will yield a better constant factor. 



It is interesting to consider what happens if {||efc||} or {y/£k} is not summable. For instance, 
if ||efc|| and y/Ek decrease as 0(1/ k), then A k grows as 0(logk) (note that B k is always 

smaller than A k ) and the convergence of the function values is in O Finally, a 



k 

necessary condition to obtain convergence is that the partial sums A k and B k need to be 
in o(y/k). 

We now turn to the case of an accelerated proximal-gradient method. We focus on a basic 
variant of the algorithm where /3 k is set to (k — l)/(k + 2) \32\ Eq. (19) and (27)]: 

Proposition 2 (Accelerated proximal-gradient method - Convexity) Assume (H) 
and that we iterate recursion ^ with y k = x k + j^( x k ~ x k—i)- Then, for all k ^ 1, we 
have 



f(x k )-f(x*)^ - 2L _ ( \\x -x*\\+2A k + x/2B k ) , (6) 




with 

k .o 

. i% = £^. 

i=l 

In this case, we require the series {&||efc||} and {k^/sk} to be summable to achieve the 
optimal 0(1/ k 2 ) rate, which is an (unsurprisingly) stronger constraint than in the basic 
case. A sufficient condition is for ||e&|| and ^/Sk to decrease as 0(l/k 2+s ) for any 5 > 0. 
Note that, as opposed to Proposition [T] that is stated for the average iterate, this bound is 
for the last iterate x k . 

Again, it is interesting to see what happens when the summability assumption is not met. 
First, if [|efc[| or y/eu decreases at a rate of 0(l/k 2 ), then fc([|efc|| +y / efc) decreases as 0(1/ k) 
and A k grows as Oilogk) (note that B k is always smaller than A k ), yielding a convergence 
rate of O f lo | 2 k \ for f(x k ) — /(#*)■ Also, and perhaps more interestingly, if ||ejj.[| or 
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decreases at a rate of 0(l/k), Eq. ^ does not guarantee convergence of the function values. 
More generally, the form of A), and indicates that errors have a greater effect on the 
accelerated method than on the basic method. Hence, as also discussed in [22], unlike in 
the error-free case, the accelerated method may not necessarily be better than the basic 
method because it is more sensitive to errors in the computation. 

In the case where g is strongly convex it is possible to obtain linear convergence rates that 
depend on the ratio 

7 = v/ L , 

as opposed to the sublinear convergence rates discussed above. In particular, we obtain the 
following convergence rate on the iterates of the basic proximal-gradient method: 

Proposition 3 (Basic proximal-gradient method - Strong convexity) Assume (H), 
that g is fi-strongly convex, and that we iterate recursion ^ with y^ = x^. Then, for all 
k ^ 1, we have: 

\\x k -x*\\ ^ (l-"/) k (\\x -x*\\+A k ) , (7) 

with 

k 

A consequence of this proposition is that we obtain a linear rate of convergence even in 
the presence of errors, provided that ||efc|| and ^/ik decrease linearly to 0. If they do so 
at a rate of Q' < (1 — 7), then the convergence rate of \\xk — x*\\ is linear with constant 
(1 — 7), as in the error-free algorithm. If we have Q' > (1 — 7), then the convergence of 
\\xk — x*\\ is linear with constant Q' . If we have Q' = (1 — 7), then \\xk — x*\\ converges to 
as 0(k (1 - 7 ) fe ) = o ([(1 - 7) + 5'f) for all 5' > 0. 




Finally, we consider the accelerated proximal-gradient algorithm when g is strongly convex. 
We focus on a basic variant of the algorithm where is set to (1 — v / 7)/(l + v/7) [301 
§2.2.1]: 

Proposition 4 (Accelerated proximal-gradient method - Strong convexity) Assume 
(H), that g is fi-strongly convex, and that we iterate recursion ^ with y^ = Xk + \ + ^ (xk — 
xjs—i). Then, for all k ^ 1, we have 

f(x k ) - f( x *) < (1 - ^7)" (v^ifW) - WT) + ^\fj l + V^*) » ( 8 ) 



k 



Ak = ^2 (\\ei\\ + v / 2^)(l-V7"r /2 , fl fc = J^ei(l-V7)- 



with 



Note that while we have stated the result in terms of function values, we obtain an analogous 
result on the iterates because by strong convexity of / we have 

^\\xk - x*\\ 2 < f(xk) - fix*). 
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This proposition implies that we obtain a linear rate of convergence in the presence of errors 
provided that ||efc|| 2 and decrease linearly to 0. If they do so at a rate Q' < (1 — y/j), 
then the constant is (1 — 1/7), while if Q' > (1 — ^7) then the constant will be Q' . Thus, the 
accelerated inexact proximal-gradient method will have a faster convergence rate than the 
exact basic proximal-gradient method provided that Q' < (1 — 7). Oddly, in our analysis of 
the strongly convex case, the accelerated method is less sensitive to errors than the basic 
method. However, unlike the basic method, the accelerated method requires knowing /i in 
addition to L. If fj, is misspecified, then the convergence rate of the accelerated method 
may be slower than the basic method. 



5 Experiments 

We tested the basic inexact proximal-gradient and accelerated proximal-gradient methods 
on the CUR-like factorization optimization problem introduced in [33] to approximate a 
given matrix W, 

mm ±\\W -WXW\& + \ Iow ^\\X% + Xcoi^Xjllp . 

i=l j=i 

Under an appropriate choice of p, this optimization problem yields a matrix X with sparse 
rows and sparse columns, meaning that entire rows and columns of the matrix X are set to 
exactly zero. In [33], the authors used an accelerated proximal-gradient method and chose 
p = 00 since under this choice the proximity operator can be computed exactly. However, 
this has the undesirable effect that it also encourages all values in the same row (or column) 
to have the same magnitude. The more natural choice of p = 2 was not explored since in 
this case there is no known algorithm to exactly compute the proximity operator. 

Our experiments focused on the case of p = 2. In this case, it is possible to very quickly 
compute an approximate proximity operator using the block coordinate descent (BCD) 
algorithm presented in [12] . which is equivalent to the proximal variant of Dykstra's algo- 
rithm introduced by [M]- In our implementation of the BCD method, we alternate between 
computing the proximity operator with respect to the rows and to the columns. Since the 
BCD method allows us to compute a duality gap when solving the proximal problem, we 
can run the method until the duality gap is below a given error threshold Ek to find an Xk+i 
satisfying Q. 

In our experiments, we used the four data sets examined by [33^ and we choose X row = .01 
and X co i = .01, which yielded approximately 25-40% non-zero entries in X (depending 
on the data set). Rather than assuming we are given the Lipschitz constant L, on the 
first iteration we set L to 1 and following [2] we double our estimate anytime g(xk) > 
giVk-l) + (g'(yk~i),Xk - Uk-i) + (L/2)\\x k - yk-i\\ 2 - We tested three different ways to 
terminate the approximate proximal problem, each parameterized by a parameter a: 

• £k = l/k a : Running the BCD algorithm until the duality gap is below l/k a . 
1 The datasets are freely available at http://www.gems-system.org 
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• £k = a: Running the BCD algorithm until the duality gap is below a. 

• n = a: Running the BCD algorithm for a fixed number of iterations a. 

Note that all three strategies lead to global convergence in the case of the basic proximal- 
gradient method, the first two give a convergence rate up to some fixed optimality tolerance, 
and in this paper we have shown that the first one (for large enough a) yields a convergence 
rate for an arbitrary optimality tolerance. Note that the iterates produced by the BCD 
iterations are sparse, so we expected the algorithms to spend the majority of their time 
solving the proximity problem. Thus, we used the function value against the number of 
BCD iterations as a measure of performance. We plot the results after 500 BCD iterations 
for the four data sets for the proximal-gradient method in Figure [TJ and the accelerated 
proximal-gradient method in Figure [2j In these plots, the first column varies a using the 
choice £fc = l/k a , the second column varies a using the choice = a, and the third column 
varies a using the choice n = a. We also include one of the best methods from the first 
column in the second and third columns as a reference. 

In the context of proximal-gradient methods the choice of e& = 1/A; 3 , which is one choice that 
achieves the fastest convergence rate according to our analysis, gives the best performance 
across all four data sets. However, in these plots we also see that reasonable performance 
can be achieved by any of the three strategies above provided that a is chosen carefully. For 
example, choosing n = 3 or choosing = 10 -6 both give reasonable performance. However, 
these are only empirical observations for these data sets and they may be ineffective for 
other data sets or if we change the number of iterations, while we have given theoretical 
justification for the choice = 1/k 3 . 

Similar trends are observed for the case of accelerated proximal-gradient methods, though 
the choice of Eu = 1/k 3 (which no longer achieves the fastest convergence rate according 
to our analysis) no longer dominates the other methods in the accelerated setting. For 
the SRBCT data set the choice Ek = which is a choice that achieves the fastest 

convergence rate up to a poly-logarithmic factor, yields better performance than = 1/fc 3 . 
Interestingly, the only choice that yields the fastest possible convergence rate [e^ = 1/k 5 ) 
had reasonable performance but did not give the best performance on any data set. This 
seems to reflect the trade-off between performing inner BCD iterations to achieve a small 
duality gap and performing outer gradient iterations to decrease the value of /. Also, the 
constant terms which were not taken into account in the analysis do play an important role 
here, due to the relatively small number of outer iterations performed. 

6 Discussion 

An alternative to inexact proximal methods for solving structured sparsity problems are 
smoothing methods [35j and alternating direction methods [36J. However, a major dis- 
advantage of both these approaches is that the iterates are not sparse, so they can not 
take advantage of the sparsity of the problem when running the algorithm. In contrast, 
the method proposed in this paper has the appealing property that it tends to generate 
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Figure 1: Objective function against number of proximal iterations for the proximal-gradient 
method with different strategies for terminating the approximate proximity calculation. 
From top to bottom we have the 9-Tumors, Brain_Tumorl, Leukemial, and SRBCT data 
sets. 
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Figure 2: Objective function against number of proximal iterations for the accelerated 
proximal-gradient method with different strategies for terminating the approximate prox- 
imity calculation. From top to bottom we have the 9-Tumors, Brain-Tumor 1, Leukemial, 
and SRBCT data sets. 
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sparse iterates. Further, the accelerated smoothing method only has a convergence rate of 
0(1/ k), and the performance of alternating direction methods is often sensitive to the exact 
choice of their penalty parameter. On the other hand, while our analysis suggests using 
a sequence of errors like 0(l/k a ) for a large enough, the practical performance of inexact 
proximal-gradients methods will be sensitive to the exact choice of this sequence. 

Although we have illustrated the use of our results in the context of a structured sparsity 
problem, inexact proximal-gradient methods are also used in other applications such as 
total- variation [8] and nuclear-norm [10] regularization. This work provides a theoreti- 
cal justification for using inexact proximal-gradient methods in these and other applications, 
and suggests some guidelines for practioners that do not want to lose the appealing conver- 
gence rates of these methods. Further, although our experiments and much of our discussion 
focus on errors in the calculation of the proximity operator, our analysis also allows for an 
error in the calculation of the gradient. This may also be useful in a variety of contexts. For 
example, errors in the calculation of the gradient arise when fitting undirected graphical 
models and using an iterative method to approximate the gradient of the log-partition func- 
tion [37J. Other examples include using a reduced set of training examples within kernel 
methods [38 | or subsampling to solve semidefinite programming problems |39j . 

In our analysis, we assume that the smoothness constant L is known, but it would be 
interesting to extend methods for estimating L in the exact case [2] to the case of inexact 
algorithms. In the context of accelerated methods for strongly convex optimization, our 
analysis also assumes that /i is known, and it would be interesting to explore variants that 
do not make this assumption. We also note that if the basic proximal-gradient method is 
given knowledge of fj,, then our analysis can be modified to obtain a faster linear convergence 
rate of (1 — 7)/(l + 7) instead of (1 — 7) for strongly-convex optimization using a step size of 
2/(/i + L), see Theorem 2.1.15 of [30J. Finally, we note that there has been recent interest 
in inexact proximal Newton-like methods |40| . and it would be interesting to analyze the 
effect of errors on the convergence rates of these methods. 
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Appendix: Proofs of the propositions 

We first prove a lemma which will be used for the propositions. 

Lemma 1 Assume that the nonnegative sequence {ut} satisfies the following recursion for 



all k > 1; 



k 




i=l 
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with {Sk} an increasing sequence, Sq ^ Uq and Aj for all i. Then, for all k ^ 1, i/ien 

2\ 1/2 



Proof We prove the result by induction. It is true for k = (by assumption). We assume 
it is true for k — 1, and we denote by = max{«i, . . . , itfc_i}. From the recursion, we 
thus get 

A 2 

(u fe - A fc /2) 2 < S k + + w fc _! ^ Ai 

i=l 

leading to 

A / A 2 fc_1 \ 1/2 



and thus 

Wfc ^ max < 



vjfc-l! ~ + ( S'fc + + Vk-i ^2 A» J 



i=i 



The two terms in the maximum are equal if v\_ x = Sk + Wfc_i ^4=1 ^i; i- e -> f° r v k-l = 

\Y!t=x^i + ( -Sfc + (l Yh=i ^i) ) • If ^ tnen v k < since the two 

terms in the max are increasing functions of Vf.—i- If ffc-i *S ^ife-i' then Vk-i ^ ^ + 
A 2 fc-1 \ 1 / 2 

5*fc + -j 1 + Vk-i Yli=i ^i) ■ Hence, Vk ^ ujfe-i) and the induction hypotheses ensure that 



the property is satisfied for A;. 



The following lemma will allow us to characterize the elements of the £fc-subdifferential of 
h at Xk, d £k h(xk). As a reminder, the e-subdifferential of a convex function a at x is the 
set of vectors y such that a(t) — a(x) ^ y T (t — x) — e for all t. 

Lemma 2 If Xi is an Ei-optimal solution to the proximal problem ^ in the sense of Q, 
then there exists such that ^ an d 

L \ Vi-i - xi - ^-(g'iyi-i) + ei) - fi \ G d £t h(xi) . 

Proof We first recall some properties of e-subdifferentials (see, e.g., [HI Section 4.3] for 
more details). By definition, x is an e-minimizer of a convex function a if and only if 
a(x) ^ inf^gRn a(y) + e. This is equivalent to belonging to the e-subdifferential d e a(x). 
If o = a\ + a,2, where both a\ and a2 are convex, we have d £ a{x) C d £ a\{x) + d £ a2{x). 



13 



If a\(x) = h \x — z|| 2 , then 



d e ai{x) = { y G 



x — z 



L 



y £R",y = Lx- Lz + Lf 



«S e 



If a2 = h and x is an e-minimizer of a\ + 02, then belongs to d £ a(x). Since d £ a(x) C 
d £ a\(x) + d £ a,2(x), we have that is the sum of an element of d £ a\(x) and of an element of 
d £ h(x). Hence, there is an / such that 



Lz-Lx-Lf G d £ h(x) with ||/|| 



(9) 



< J^i and 



Using z = yi-\ — (1/ 'L)(g' (y?.-i) + e,) and x = Xi, this implies that there exists fi such that 
L (y%-\ - xi - ^-{g'(yi-i) + e { ) - f^j G d £i h(xi) . 



In |28^ Definition 2.1], Eq. ^ is replaced by Lz — Lx G d £ h(x). Hence, their definition of an 
approximate solution is equivalent to ours but using / = 0. If we replace by in the 
proof of Proposition^ we get the 0(1/ k 2 ) convergence rate using any sequence of errors 
{e k } necessary to achieve the 0(1/ k 2 ) rate in |28} Th. 4.4]. We can also make the same 
assumption on / in Proposition [T] to achieve the optimal convergence rate with a decay of 
y/ej, in O (l/k°- 5+s ) instead of O (l/k 1+s ). 



6.1 Basic proximal-gradient method with errors in the convex case 

We now give the proof of Proposition of [TJ 

Proof Since x k is an e^-optimal solution to the proximal problem (|2j)in the sense of Q, 
we can use Lemma J^j to yield that there exists f k such that \\fk\\ ^ y ^f* and 

L (x k -i ~x k - ^(g'(x k -i) + e k ) - f k ^j G d £l h(x k ) . 

We now bound g(xi) and h(x,j) as follows: 

g(xi) ^ g(xi-i) + (g'(xi-i), x { - + ^\\xi - Xj_i|| 2 

using L-Lipschitz gradient and the convexity of g, 
< g(x*) + {g'(xi-i),Xi-\ - x*) + (g'(xi-i),Xi - + - £;_i|| 2 
using convexity of g. 
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Using the £j-subgradient, we have 

h(xi) ^ h(x*) - (g'(xi-i) + &i + L(xi + fi — Xi-i),Xi - x*) + Si . 

Adding the two together, we get: 
f(xi) = g(xi) + h(xi) 

= fix*) + -^\\xi - Xi-\\\ 2 — L(xi — Xi-i,Xi - x*) +£j — (e» + Lfi,Xi - x*) 

= f(x*) + — (xi - Xi-i,Xi - Xi-i - 2xi + 2x*) +E{ - (a + Lfi,Xi - x*) 

= f(x*) + - (xi - x* - (xi-i - x*), (x* - x^ + (x* - Xi-i)) +Ei- (e, + Lfc, Xi - x*) 



L , 



L 



f( x *) ~ ^\\ x i ~ x *\\ + Trlkt-i - x *\\ +Ei- (ei + Lfi,Xi — x 



L 



L 



f(xi) ^ fix*) - -\\ Xi - x*\\ 2 + -||x f _i - x*\\ 2 + Si + (\\ei\\ + ^2Le~i) 



using Cauchy- Schwartz and 



Moving f{x*) on the other side and summing from i = 1 to k, we get: 

k 



£[/0*)-/GO] < 



L, 



i=i 
i.e. 

k 



' \ '''+^\\xo-x*\\ 2 +^2ei+^2 [i\\ e i\\ + V 2L£ i) • IN - x * 

i=l i=l 



- /(a:*)] + ^||x fc - x*|| 2 < ^[|x - **|| 2 + + £ [(||ei|| + 



3/ 7" 3j 



(10) 



Eq. (10) has two purposes. The first one is to bound the values of \\xi — x*\\ using the 



recursive definition. Once we have a bound on these quantities, we shall be able to bound 
the function values using only \\x$ — x*\\ and the values of the errors. 



6.1.1 Bounding \\xi — x*\\ 

We now need to bound the quantities \\xi — x*\\ in terms of ||xq — z*||, e% and £j. Dropping 



the first term in Eq. (10), which is positive due to the optimality of fix*), we have: 

„ k k 



\\x k -x*\\ 2 ^ ||x -^|| 2 + z ^e t + 2^ 



i=i 



\^i\\ ! / 2£j 

T vT 



We now use Lemma 



;using Sk = \\x - x*\\ 2 + £ Ya=i e i an d h = 2 



INI , 2e± 

L ' y L 



\\xk~x*\\ < ^2 



1 y + 0) + ("*°-^ 2+ it^ 

i=i \ / V t=i 



E 

,i=i 



1 61 1 1 /2£j 

IT vX 



) to get 

2\ 1/2 
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Denoting A k = J%1 (V + ^/¥) and B k = £ti % we get 

||x fc - x*K A fc + (||x - x*f + 2B k + A 2 ) 1/2 . 
Since Ai and Bi are increasing sequences (||ej|| and e% being positive), we have for i ^ k 

||xi-x*|| < Ai + (\\x Q -x*\\ 2 + 2B i + A 2 ) 1/2 

^A k + {\\x -x*\\ 2 + 2B k + Al) 1/2 

^A k + \\x — as* || H- \[2~B k + A fe 
using the positivity of \\xq — x*\\ 2 , B k and A\. 



6.1.2 Bounding the function values 

Now that we have a common bound for all \x% — x*\ with i ^ k, we can upper-bound the 



right-hand side of Eq. (10) using only terms depending on \\xq — x*\\ , and e%. 
Indeed, discarding 4||a?fc — x*|| 2 which is positive, Eq. (10) becomes 

k 

Y,[f(xi)- f(x*)} ^ ^\\x -x*f + LB k + LA k (A k + \\x -x*\\ + ^/2B~ k + A k 



i=l 



L, 



< -\\x - x*\\ 2 + LS fc + 2LA\ + LAfc||a;o - x*\\ + LA k \J~2B k 
L 



< - (||x -x*|| +2A fc + v^)' 



Since / is convex, we get 



8=1 



i=l 



- 2^ (||x -x*||+2A fc + v^Bfc) . 



6.2 Accelerated proximal-gradient method with errors in the convex case 

We now give the proof of Proposition [2] 
Proof Defining 

e k = 2/(k + \) 

V k = + —(Xfe - Xfc_l) , 
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we can rewrite the update for yt as 

y k = (1- e k+ i)x k + k+1 v k , 

because 

2 2 k + 1 

(1 - 0fc+i)zfc + #fc+i^fc = (1 - ~j^ij) x k + -j^^[ x k~i H ^— (a^fc - arfc-i)] 

k-1 

= X k - - — 2 (Xfc - Xfe-l) = J/fc. 

Because </ is Lipschitz and g is convex, we get for any z that 

g{x k ) < g(yk-i) + (g'{yk-i),x k - yk-i) + 7^ IN - yfc-i|| 2 

< sK^) + (g'(yk-i),yk-i - z ) + (g'(yk-i),x k - y k -i) + - y^-iH 2 . 

Because —\g'(yk-i) + e k + L(x k + f k - Vk-i)] G d £k h(x k ), we have for any z that 

M 3 ^) ^ £fc + + (L(y k -i - x k ) - g'(yk-i) ~ e k + Lf k , x k - z) 

= e k + h(z) + (g'(y k -i),z — x k ) + L (x k - yk-i,z - x k ) + (e k + Lf k , z - x k ) 

Adding these bounds together gives: 

g(x k ) + h(x k ) = f(x k ) ^e k + f(z) +L(x k - y k -i,z- x k ) + ^\\x k - y k -i\\ 2 + (e k + Lf k ,z- x k ) 



Choosing z = 9 k x* + (1 — 9 k )xk-i gives 

f(x k ) ^£k + f(0kx* + (1 - 9)x k -i) + L(x k - y k -i,9 k x* + (1 - 9 k )x k -i - x k ) + ^||x fe - y k -\\\ 2 

+ (e fe + Lf k , 9 k x* + (1 - 9k)xk-i - x k ) 

^ £fe + Okf{x*) + (1 - 9k)f{x k -i) +L(x k - y k -i,9 k x* + (1 - 6» fc )x fc _i - x fc ) + - 1 1 1 2 

+ (e fc + Lf k , 9 k x* + (1 - fc )z fc -i - x k ) (11) 
using the convexity of / and the fact that 9k is in [0, 1]. 

Since 

6 k x* + (1 - 9 k )x k -i -x k = k (x* - v k ) 

and 

Xk - yk-i = Q kVk + (l - &k)xk-i - Vk-i 
= 9kVk - 9kVk-i , 
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we have 

L (x k - yk-i,O k x* + (1 - Ok)xk-i ~ %k) = L6 2 k (v k - v fc _i, x* - v k ) 



L 



\\x k - y k - 



L6 2 k \\v k -x*\\ 2 + L9% (v k -x*,v k ^-x*) 



(12) 



LQl 



2 



-\\v k - v k -x\\ 2 



v k -x*\\ +\\v k -i-x*\\ - 2 (v k - x*,v k -i 

(13) 



(e k + Lf k ,6 k x* + (1 - 9 k )x k -i - x k ) = 6 k (e k + Lf k , x* - v k ) . 



Summing Eq. (12) and (13), we get 



L L6 2 

L {x k - y k -i,6 k x* + (1 - O k )x k -i - x k )+-\\x k -y k -i\\ 2 = (\\v k ^i - x*\\ 2 - \\v k - x*\\ 2 ) 



Moving all function values in Eq. (11) to the left-side, we then get 

f{x k ) ~ O k f(x*) - (1 - 0fc)/(s fc _i) < L6l (H-i - x*\\ 2 - \\v k - x*\\ 2 ) +e k + 6 k (e k + Lf k ,: 



Reordering the terms and dividing by Of, gives 



^(/(^)-/(x*))+^||^-^|| 2 ^ ^2^(/(^i)-/(^))+^lk fc -i-^|| 2 +S + ^ (e fc + ,L/ fc , 
°k 2 d k 2 



9 fc °k 



Now we use that for all k greater than or equal to 1, 



I -0 k 1 



°k U k-l 



to apply this recursively and obtain 



l(f(x k )-f(x*)) + ^\\v k -x*\\ 2 ^ l —± 
Vk 2 Vq 



L(/(x )-/(x*)) + 



„*I|2 



k 

+ Y,U\\ei\\ + V 2 ~L^) 



x 



X -Vi 



i=l 



using || fi\\ ^ \ 2 ji, Since vq = xq and 6q = 2, we get 



t f)2 j o2 2 c 1 / 

i=l i i=l * 



X -Vi 



(14) 



As in the previous proof, we will now use Eq. (14) to first bound the values of \\vi — x* 
then, using these bounds, bound the function values. 
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6.2.1 Bounding — x*|| 

We now need to bound the quantities \\vi — x*\\ in terms of \\xq — x*\\, e, and £i 



1=1 8 i=l 1 \ 



+ 




X - Vi 



Since 6>j = 2/(i + 1), ^ = ^ ^ i since i ^ 1. Thus, we have 



IK - z 



* l|2 



k / 
i=l i=l V 



f ' / 1 1 ! / 

l Vir 



X -Vi 



From Lemma 



(using S k = \\x - x*\\ 2 + £ X^ =1 i 2 £i and Aj = 2z 



denoting ,4 fc = £*=i + \/¥ ) and B k = Ei=i SfS we g et 



JIM _|_ . 



and 



||u fc - as*|| < A fc + ||x - x*|| 2 + 2^ fc + A 



1/2 



Since and Bi are increasing sequences, we also have for i ^ k: 

1 /2 

||^-x*|| (j|x -x*|| 2 + 2i?i + 2 2 ) 

< ||x -x*|| +2i ^ + i? ^ 1/2 v / 2 
s; \\x -x*\\+2A k + Bl /2 V2 . 



6.2.2 Bounding the function values 



Lei, 



Dropping -^H^fc — x*\\ in Eq. (14) (since it is positive), we thus have 



f(x k ) ~ f(x*) < ( ||x - x* || 2 + 2B k + 2A fc 



||x -x* II +2A k + sJ2B, 



LO 2 

< ( ||x -x*|| 2 + 2S fc + 2A fc ||xo-x*||+4^ + 2A fc V2B fc 



L# 2 

< ( \\x Q -x*\\+2A k + ^2B k 



and 



^(f(x k ) - /(**)) < | ( ||.r„ - y jj -!- 2.1, + X /2D,,. 
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6.3 Basic proximal-gradient method with errors in the strongly convex 
case 



Below is the proof of Proposition [3] 

Proof Again, there exists ft such that \\ft\\ ^ \J ^ and 

L (xi-i -Xi- —(g'(xi-i) + e») - ft) G d £i h(xi) 



Since x* is optimal, we have that x* = prox L (a;* — j^g'(x*)) . 
We first separate f k , the error in the proximal, from the rest: 



\X k x 



.* 1 1 2 



!>''<«/. ( arjfc-i - ^g'(x k -i) - j-e k j + fa - prox L fx* - ^-g'(x*) 



prox L \ - -g'(x k -i) - -^-e fe j - prox L fx* - ^-g'(x*) 



+ \\fk\\ 2 



+ 2 { ./'/,. \nox L ( x fc _i - -g'(x k -i) - ^e k j - prox L fx* - -^'(x*) 



pmx^ | x k ~i - -g'(x k -i) - -e k j - prox L fx* - -^'(x*) 



+ 



2sk 
L 



+ 2 



2£fc 
L 



prox L ( x fc _i - —g'{xk-i) - j^e k j - prox L fx* - jd\ x * . 



using Cauchy-Schwartz and ||/fc|| ^ 



Zfc-l - jg'{x k -{) - -e k - x* + j-g'(x*) 



+ 



+ 2 



2£fc 
L 



111 

a:*— l - j9 {x k -i) - je-k - x* + -g (x*) 



using the non-expansiveness of the proximal 



x k -\ - -^g'{xk-x) - ^re k - x* + vs'Oc*) 



+ 



2_£fc 
L 



+ 2 



2£fc 



Xfc-l - x 



(5 (x k -i) -g (x*)) 



+ 



L 



using the triangular inequality. 
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We continue this computation, but now separating e k , the error in the gradient, from the 
rest: 



1 



\\xk ~x*\\ = \\x k -i - x* - T (g'(x k ~i) - g'(x 



II *\\||2 , ll e fc 



L 



+ 



L 2 L 



"ejfel 



||x fc _i - x* - -(^(xfc-i) - ff'(a?*))|| + 



sC [|rcfe— i -x* - -^(5(^-1) - 5'0*))|| 2 + ^ L + tINHH^-i ~ x * ~ T^'^fc-i) ~ d'( x * 



I? 



2 1 
L 1 



1 



Ffc-l - x 



L 



(g\x k ^)-g , (x*))\\ + 



\e k \ 
L 



using Cauchy- Schwartz 

1 

L 

r 2\\e k 



2e k , 2 2e k 



\\x k -i - x* - ^{g'ix^-i) - g'(x*))\\ 2 + + -j- + -y — 1|( 



We now need to bound (p^-i — x 
1 



x k -x - x* - -(g'(x k -i) - g'(x*)) 



L 



Tjid' i x k-i) ~ 9'{ x *))\\ to get the final result. We have: 



||a?fc_i -x*- T (g'{x k -i) -g'(x*))\\ =\\x k „ 1 -x*\\ + \\g'{x k -i) - g'(x 



L 



L 



{g'(x k -i) - g'(x*),x k „i - x*) 



< I|x fc _ 1 -X*|| 2 + Z 2|| 9 '(x fc _ 1 )- 5 '(X*)|| 2 

1 I, / / N / / 1,0 Lfl 



: \\g'(x k ^)-g'(x*)\\ 2 + 



using theorem 2.1.12 of [30 
= (1 

^(1 



2// )\\ Xk _ 1 - x *f + 



1 I 1 



L + fj, 
2/i 



L + 

2 



)||x fc _ 1 -x*|| 2 + / '^ ' 



L \L L + fi 
2 1 2 



[|iCfe— l - £*l| 2 
lls^fc-i} 



'/■_*M|2 



<7> 



)[|x fc _i - X 



* l|2 



L + fj, L L L + /x 

using the negativity of ^ — -^fp- and the strong convexity of g 

= (i-f)Wi-*T- 



Thus 



\\x k -x*f ^ (i-ff\\x k ^-xr + 



+ 



L 2 



2e fc 2 /2e fc 
+ — + -A/ — liefc 



L L\L 

\\xk-! - X*\\ 



||x fc _i - x*|| + — — + 



L 
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Taking the square root of both sides and applying the bound recursively yields 

k 



\Xk - < 1 



\Xq — X 



i=i 



k—i 



6.4 Accelerated proximal-gradient method with errors in the strongly 
convex case 

We now give the proof of Proposition |4j 
Proof We have (following [30J) 

x k = - j^g{yk-i) ■ 



We define 



a k = (1 _ Q i)"Ll + J^k 

Vk = Xk-i H — (xk ~ x k -i) 

ak-i 

_ Uk-j 
Vk - , _ ^ 
1 L 



Vk = x k + 9 k (v k - x k ) . 
If we choose ao = \Jli then this yields 



Vk = x k + — — — (x k - xk-i) ■ 
1 + V7 



We can bound g{x k ) with 



g(xk) < g(yk-i) + (g'{yk-i),x k - y k -i) + - y/c-ill 2 

using the convexity of g 

< g{z) + (g'{yk-i),yk-i - z) + (g'(yk-i),x k - y k -i) + — \\x k - yk-i\\ 2 - ^\\yk-i - z\\ 2 
using the /U-strong convexity of g. 

Using Lemma[2j we have that —\g'(yk-i) + e k + + f k - y k -i)] € d £k h(x k ). Hence, we 
have for any z that 

^Ofc) < £fc + ^0) + (L(Vk-l ~ x k) - g'{yk-i) -e k - Lf k , x k - z) 

= e k + h(z) + (g'(y k -i),z — x k ) + L (x k - y k -i,z - x k ) + (e k + Lf k , z - x k ) 
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Adding these two bounds, we get for any z 



f(x k ) ^e k + f(z) +L(x k - y k -i,z- x k ) + -\\x k - y k -i\\ 2 - ^\\y k -i - z\\ 2 + (e k + Lf k ,z- x k ) 



Using z = a k -ix* + (1 - a k -i)x k -i, we get 

f(x k ) (^e k + f{a k -ix* + (1 - a k -i)x k -i) + L (x k - y fc _i, a fc _ix* + (1 - a k -i)x k -i - x k ) 

+ ~^\\ x k ~ Vk-\\\ 2 ~ ^hk-i - ak-ix* - (1 - a k ^i)x k -i\\ 2 

+ (e k + Lf k , a fc _ix* + (1 - a fc _i)x fe _i - x k ) 
<,e k + a fe _i/(x*) + (1 - a fc _i)/(x fc _i) + L(x k — y^, a fc _ix* + (1 - a fe _i)a; fe _i - x fc ) 

+ "^Ikfe - yfc-i|| 2 - |||yfc-i - ak-ix* - (1 - a fe _i)x fe _i|| 2 - ^a fe _i(l - a fc _i)||x* - x k -i\ 

+ {e k + L/ fe , a fc _ix* + (1 - a fc _i)x fe _i - x fe ) 
using the /i-strong convexity of /. 

We can replace x k — y k -\ using 

x k - y k -i = x k - x k -i - 6 k -i(v k -i - x k -i) 

= Ok-iXk-i + ^^-{x k - arjfc-i) + f 1 - ^— ^ J (x fe - x fc _i) - 6» fe _iw fc _i 

= 9 k -i(v k - v k -i) + ( 1 - J (x fc - x fc _i) . 

We also have 

(1 - Q fe _i)x fe _i - Xfc = -ajk-iWfe 
a fe _ix* + (1 - afc_i)xfc_i - x fc = a fe _i(x* - , 

and 

Vk-i ~ a k -ix* - (1 - a fe _i)x fe _i = y k -i - a k -i(x* - v k ) - x k 

= a k -i(v k - x*) - 6> fe _i(w fe - - ( 1 - ) ( Xk - Xjfc_i) 



2 
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Thus, 

f(xk) ^ £fc + a k -if(x*) + (1 - a k -i)f{x k -i) 

- L6 k -ia k ^i(v k - v k -i,v k - x*) - L(a k -i - 9 k -i){x k - x k -i,v k - x* 



i l|2 , V <*k-lj I, 

+ 2 Ffc-^fc-i|| + g " Xfc " 

+ L(9 fc _i ( 1 - ^=1 ] ( Ufe - x fc - a;fc_i) 



2 



0fc-l X 



2 — Ffc-ar || — ||ufc -Ufc-i|| - —\\x k - x fc _m 

+ na k -iO k -i{v k - x*,v k - v k -i) + fj,(a k -i - 9 k -i){v k - x*,x k - x k -i) 

- IJ-Ok-i ( 1 - ) {vk ~ v k -i,x k - x fc „i) - ^afc-i(l - a fc _i)||x* - x fe _i|| 2 

+ ajfc-i (ejfc + Affc,z* - . 

To avoid unnecessary clutter, we shall denote .Efc the additional term induced by the errors, 
i.e. 

E k = e k + a fe _i (e fe + Lf k , x* - v k ) . 
Before reordering the terms together, we shall also replace all instances of v k — v k -\ with 

v k - x* - o fc _i - x*y. 

f(x k ) < E k + a k -if{x*) + (1 - - L9 k -ia k -i\\vk - x*\\ 2 + LO k -ia k - 1 {v k - 1 - x*,v k - 

- L(a k -i - 9 k -i)(x k - x k -i,v k - x*} 

H ^ FA - 37 II 2^ Ffe-i-^|| - (£- f-t)8 k -i{v k -x ,v k -i -x ) 

2 



H ^ — Fit - Zfc-i| 



+ L6» fc _! ( 1 - J (v fc -x*,x k - x k -i) - L(9 fc _i ( 1 - ^— ^ ) -x*,x fc -x fe _i) 

V "ifc-i/ V «fe-i/ 



^ a k-l n *|i2 

2~ H II 

+ /xa jfc _i6' fc _i||vjt -x*|| 2 - na k -iO k -i(v k -x*,v k ^i - x*) 
+ /x(a fc _i - 9 k -i){v k - x*,x k - x k -i) 

- fjAk-i [ 1 - J (wfe - x*, x fc - Xjfc_i) + fj,6 k -! ( 1 - ) (u fc _i - x*, x fc - x fc _i) 

- ^afe-U 1 - ttfe-i)!!^* - i II 2 • 
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With a bit of well-needed cleaning, this becomes 

f(x k ) < E k + a k -if(x*) + (1 - 
L — (i 



+ 



(#fc-l - ttfc-lj ~ 



||t;fc - X 



* l|2 



+ o H-i " a; || 



+ (L — fj,)6k-i(a k -i - 9 k -i){v k -i - x*,v k - x*) 

+ {L- n)(9 k -i ~ afc-i) ( 1 - ^ J - x k -i,v k - x*) 

- (L - At)6» fc _i ( 1 - ] (v fc _ 1 - x *, x k - x k -i) 



- ^k-ii 1 ~ <x k -i)\\x* - x fc _i|| 2 . 



We can rewrite x k — x k -\ using 

x k - x k -i = a k -i(v k - x k -i) 

= a k -i(v k - x*) - a k -i{x k -i - x*) . 

We may now compute the coefficients for the following terms: \\v k — x*\\ 2 , \\v k -i — x*|| 2 , 
(v k -! - x*,v k - x*), (x fe _i - x*,v k - x*), - x*,i> fc _i - x*) and ||x* - x fc _i|| 2 . 

For \\v k — x*|| 2 , we have 
L — (i 



2 La^i 2 (L-^)(^_i-q fc _i)- 

fe-l — Ojfc-lJ r — //J(fc'fe_l — Ctfe-lJ H 



La l-l 



\\v k - x*|| 2 term ^ " x *- 1 ' v * " X *> term ||x fc - x k ^f term 



For \\v k -i — x*\\ , there is only one term and we keep 



For (u fc _! -x*,v k - x*), we get 

(L - n)6 k -i{a k -i - fc _i) - (L - fi)9 k -i (a k -i - 9 k -±) = 0. 

V v ' V v ' 

(v k -i - x*,v k - x*) term - x*,x k - x k -i) term 

For (xfe_i — x*,v k — x*), we get 

(L - /i)(0 fc _i - «fc-i) 2 - (L - /i)(0*_i - «fe-i) 2 = . 



(x fe - x k _ 1 ,v k - x*) term - x fc -i|| 2 term 
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For (xk-i - x*,v k -i - x*), we get 

(L - n)9 k -i (afe-i - Ok-i) = (L - n)9k-i (ttfe-i - Ok-i) ■ 

V v ' 

(ufe_i - x* , x k - x fe _ i ) term 

For 1 1 a:* — Xfc_i|| 2 , we get 

(L - n){9 k ^ - a fe _i) 2 A* /i 

2 -a fe _i(l - a fe _i) = --9 k -i(l - a k -i) . 

v ' V v ' 

\\x k — Xfc_i|| 2 term \\x* — Xfc_i|| 2 term 

Hence, we have 

f(x k ) < E k + a fc _i/(a;*) + (1 - a fc _i)/(x fc _i) 

ia fc-l|, *,|2 

2 — — 35 II 

, (^-M)flg-i „ * ||2 
+ 2 \\vk-i-x || 



+ (L - n)9 k -i (a k -i - 6 k -i) (v k -i - x*, x fc _i - x*) 
- ^0fc-i(l - afe-i)||xfe-i - x*|| 2 

9 k -i(L — fi) 2 (9 k _i — a k -i) 2 * ,,2 



2//(l - a fc _i) 



-\\vk-i ~ x* 



. 9 k _i(L — /i) 2 (6>fc_! — afc_i) 2 t||2 

+ TTTi S \\v k -i-x\\ , 

2//(l - a k -i) 

the last two lines allowing us to complete the square. We may now factor it to get 
f(x k ) < E k + a k -if(x*) + (1 - a fe _i)/(x fe _i) 

La l-lu *i|2 

2~ || 

(L^ti „ ||2 
+ g \\v k -i-x || 



- afc-i) 



Xfc_l-X r K-l-X 

//(l - a fc _i) 



fc _i(L-^) 2 (# fe _i -a fe „i) 2 2 

H ^ \ Wfc-l-X . 

2//(l - a fc _i) 

Discarding the term depending on x^_i — x* and regrouping the terms depending on \\v k -i - 
x*|| 2 , we have 

f(x k ) ^ E k + a k -if{x*) + (1 - a k -i)f(x k -i) 

La l-1\\ *i|2 

2~ Ffc-a: II 

(La k _ 1 - n)a k -i 2 
+ o ||v fc -i-x|| . 
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Reordering the terms, we have 



f{x k )-f{x*)+ La f±\\v k -x*f ^ (l-a fe _x) (/(x fc _!) - f{x*))+ {Lak ~ l 2 /i)afc - 1 |b fc _ 1 -x*|| 2 +E fc . 

(15) 



We can rewrite Eq. (j 1 5 [) as 

/(x fc ) - /(x*) + 



Lee?., 2 / La?_ 2 
— ~ — hk - x \\ < (1 - ajt-i) f(xk-i) - f(x) H — Hvjfc-i - x 



* 1 1 2 



Lag ^ Q , fc _ 1 _ (i _ afc _i)L a 2 
+ 2 \\Vk-i-x || 



Using 



and denoting 



S k = f(x k )-f(x*) + ^\\v k -x*r, 



.* Il2 



(16) 



we get the following recursion: 



<5 fc < 1 



L 

Applying this relationship recursively, we get 

k k 



dk-l + Ek ■ 



k-t 



Since E k = £k + Ofc-i (efe + L/&, x* — u^), we can bound it by 



(17) 



(18) 



Ek ^ £fc + 



x 



using H/fcll < J %s Plugging Eq. (|19|) into Eq. (j 1 S|) , we get 



4^1 



|e t || + v / ^£t 



(19) 



(20) 



Again, we shall use Eq. (18) to first bound the values of \\vi — x*||, then the function values 
themselves. 
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6.4.1 Bounding — x*\\ 

We will now use Lemma [l] to bound the value of — a?*||. Since ||«fc — x*|| 2 is bounded by 



(using Eq. (16)), we can use Eq. (18) to get 



\\vk-x*\\ z < - ( 1 
I 1 



So + Y, £t ( 1 



t=i 



J2(\\e t \\ + V^t) (i 



k-t 



bt-x*\ 



Multiplying both sides by (1-y j 

2 r 



yields 



1 



\\v k -x*\\ 2 ^ 



-t/2 



-t/2 



(21) 



Using Lemma 1 with S, 



we get 



and A,- 



/Lfi 



\ ei \\ + v 7 ^) (1 



-fc/2 



IN - a;* || ^ 



^ + (^H4 A ')) 1/2 



(22) 



with 



Since i?t is an increasing sequence and the At are positive, we have for i ^ k 



1 



-i/2 



- x 



4£ A ' + ^o + ^+ i^A 



t=i 



i=l 
i 



lv^, 25 /2Bi 1^, 



/' 



i=l 



t=i 



'25o + /2Bfc 
A* V /" 
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Hence, 



\Vi — 05*11 < 1 



ill I k 



H V A* 



(23) 



6.4.2 Bounding the function values 



Denoting A fc = § £f=l A t = J^Et (Ml + V^) (l 



-i/2 



we obtain after plug- 



ging Eq. (|23|) into Eq. ([20]) 



4 < l 



5 o + ^ e * 1 



+ ,A?£(lMI + ^)(i 



t=l 
A' 



AM* </2 / 2^fc + /25o + /2Z?,,. 
L/ I y /i y /i 



2Afc + /25o + /2fl* 
' ' A* V A* V A 4 



5 + -B fc + yl fc + i / — + 4 / 

l j \ \ yu y H V 11 



< 1 



«5 + A ky- + y B k 

Using the L-Lipschitz gradient of / and the fact that vq = xq, we have 



<>„-■ w //(x )-/(x*) + ^|| V0 -x*|| 2 



< v/2(/(x ) - /(**)) . 
Hence, discarding the term — x* || 2 of 4, we have 



f(x k ) - /(**) ^ 



(yWfa) - /(**)) + A ky J^ + yfK) . 



(24) 
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